24 research outputs found
An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace
To minimize the effect of outliers, kNN ensembles identify a set of closest
observations to a new sample point to estimate its unknown class by using
majority voting in the labels of the training instances in the neighbourhood.
Ordinary kNN based procedures determine k closest training observations in the
neighbourhood region (enclosed by a sphere) by using a distance formula. The k
nearest neighbours procedure may not work in a situation where sample points in
the test data follow the pattern of the nearest observations that lie on a
certain path not contained in the given sphere of nearest neighbours.
Furthermore, these methods combine hundreds of base kNN learners and many of
them might have high classification errors thereby resulting in poor ensembles.
To overcome these problems, an optimal extended neighbourhood rule based
ensemble is proposed where the neighbours are determined in k steps. It starts
from the first nearest sample point to the unseen observation. The second
nearest data point is identified that is closest to the previously selected
data point. This process is continued until the required number of the k
observations are obtained. Each base model in the ensemble is constructed on a
bootstrap sample in conjunction with a random subset of features. After
building a sufficiently large number of base models, the optimal models are
then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for TumorC dataset.
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for TumorC dataset.</p
Brief description of the datasets along with the corresponding number of features, observations, class-wise distributions and sources.
Brief description of the datasets along with the corresponding number of features, observations, class-wise distributions and sources.</p
Classification error rates produced by different methods on simulated data.
Classification error rates produced by different methods on simulated data.</p
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Colon dataset.
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Colon dataset.</p
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Breastcancer dataset.
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for Breastcancer dataset.</p
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for DLBCL dataset.
Box-plots of the error rates produced by random forest, using top 10 features selected by different feature selection methods for DLBCL dataset.</p
Classification error rates produced by different methods on various subsets of genes.
Classification error rates produced by different methods on various subsets of genes.</p
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Lungcancer dataset.
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Lungcancer dataset.</p
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Prostate dataset.
Bar-plots of error rates of the proposed and the other classical methods on various subsets of genes for Prostate dataset.</p